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Abstract 

We observe (Xi, Yi)f =1 where the Y^s are real valued outputs and the X^s are 
m x T matrices. We observe a new entry X and we want to predict the output 
Y associated with it. We focus on the high-dimensional setting, where mT ^> n. 
This includes the matrix completion problem with noise, as well as other prob- 
lems. We consider linear prediction procedures based on different penalizations, 
involving a mixture of several norms: the nuclear norm, the Frobenius norm and 
the £i-norm. For these procedures, we prove sharp oracle inequalities, using a 
statistical learning theory point of view. A surprising fact in our results is that 
the rates of convergence do not depend on m and T directly. The analysis is con- 
ducted without the usually considered incoherency condition on the unknown 
matrix or restricted isometry condition on the sampling operator. Moreover, our 
results are the first to give for this problem an analysis of penalization (such 
nuclear norm penalization) gularization algorithm: our oracle inequalities 

prove that these procedures have a prediction accuracy close to the deterministic 
oracle one, given that the reguralization parameters are well-chosen. 

Keywords. High dimensional matrix ; Matrix completion ; Oracle inequalities ; 
Schatten norms ; Nuclear norm ; Empirical risk minimization ; Empirical process 
theory ; Sparsity 

1 Introduction 

1.1 The model and some basic definitions 

Let (X, Y) and D n = (Xi,Yi)" =1 be n + 1 i.i.d random variables with values in 
M m .T x K, where J^Am,T is the set of matrices with ni rows and T columns with entries 
in R. Based on the observations D n , we have in mind to predict the real- valued ouput 
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Y by a linear transform of the input variable X. We focus on the high-dimensional 
setting, where mT 3> n. We use a "statistical learning theory point of view" : we do 
not assume that E(Y|X) has a particular structure, such as E(Y|X) = {X, Aq) for 
some Aq G M. m ,T, where (•, •) is the standard Euclidean inner product given for any 
A,B G M mT by 

(A,B) :=tr(A T B). (1) 

The statistical performance of a linear predictor (X, A) for some A G Mm,T is mea- 
sured by the quadratic risk 

R(A):=E[(Y-(X,A)) 2 }. (2) 

If A n G M m ,T is a statistic constructed from the observations D n , then its risk is 
given by the conditional expectation 

R{A n ) :=E[(Y-(X,A n )) 2 \D n }. 

A natural candidate for the prediction of Y using D n is the empirical risk minimiza- 
tion procedure, namely any element in M m) T minimizing the empirical risk R n {-) 
defined for all A G M m ,T by 

1 n 

R n {A) = -Y j {Y l -(X l ,A)f. 
i=i 

It is well-known that the excess risk of this procedure is of order mT/n. In the high 
dimensional setting, this rate is not going to zero. So, if X \- > (Aq,X) is the best 
linear prediction of F by J, we need to know more about Aq in order to construct 
algorithms with a small risk. In particular, we need to know that Aq has a "low- 
dimensional structure". In this setup, this is usually done by assuming that Aq is low 
rank. A first idea is then to minimize R n and to penalize matrices with a large rank. 
Namely, one can consider 

A n G argmin {Rn(A) + Arank(A)}, (3) 

AdM m , T 

for some regularization parameter A > 0. But A \- > rank(^4) is far from being a 
convex function, thus minimizing (3) is very difficult in practice, see [19] for instance 
on this problem. Convex relaxation of (3) leads to the following convex minimization 
problem 

A n G argmin {R n (A) + A||A|| Sl }, (4) 

A&M m ,T 

where || • is the 1-Schatten norm, also known as nuclear norm or trace norm. 
This comes from the fact that the nuclear norm is the convex envelope of the rank 
on the unit ball of the spectral norm, see [18]. For any matrix A G M m ,T, we denote 
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by si(A), . . . , Srank(A) (^4) its nonincreasing sequence of singular values. For every 
p G [1, oo], the p-Schatten norm of A is given by 

rank(yl) . 

\\Ms p :=( £ (5) 

In particular, the || • Ws^ -norm is the operator norm or spectral norm. The || • ||s 2 -norm 
is the Frobenius norm, which satisfies 



\A\\ 2 S2 =J2 A h = (^ A )- 



hi 



1.2 Motivations 

A particular case of the matrix prediction problem described in Section 1.1 is the 
problem of (noisy) matrix completion, see [38, 39], which became very popular because 
of the buzz surrounding the Netflix prize 1 . In this problem, it is assumed that X is 
uniformly distributed over the set \e v . q ■ 1 < p < m, 1 < q < T}, where e Pj(? G M. m .T 
is such that (e p ,q)i,j = when i ^ q or j ^ p and {e P:q ) ps = 1. If E(Y|X) = (Aq,X) 
for some Aq G M m ,T, then the Y{ are n noisy observations of the entries of Aq, and 
the aim is to denoise the observed entries and to fill the non-observed ones. 

First motivation. Quite surprisingly, for matrix completion without noise (Yi = 
{Xi,Ao)), it is proved in [15] and [16] (see also [21], [32]) that nuclear norm mini- 
mization is able, with a large probability (of order 1 — m -3 ) to recover exactly Aq 
when n > cr(m + T)(logn) 6 , where r is the rank of Aq. This result is proved under 
a so-called incoherency assumption on Aq. This assumption requires, roughly, that 
the left and right singular vectors of Aq are well-spread on the unit sphere. Using 
this incoherency assumption [14], [23] give results concerning the problem of matrix 
completion with noise. However, recalling that this assumption was introduced in 
order to prove exact completion, and since in the noisy case it is obvious that exact 
completion is impossible, a natural goal is then to obtain results for noisy matrix 
completion without the incoherency assumption. This is a first motivation of this 
work: we derive very general sharp oracle inequalities without any assumption on 
Aq, not even that it is low-rank. More than that, we don't need to assume that 
E(y|X) = (X, Aq) for some Aq, since we use a statistical learning point-of-view in 
the statement of our results. More precisely, we construct procedures A n satisfying 
sharp oracle inequalities of the form 

R(A n ) < inf {R(A) + r n (A)} (6) 

A£M m , T 

that hold with a large probability, where r n (A) is a residue related to the penalty 
used in the definition of A n that we want as small as possible. By "sharp" we mean 
that in the right hand side of (6), the constant in front of R{A) is equal to one. 

1 http : //www.netf lixprize . com/ 
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A surprising fact in our results is that, for penalization procedures that involve 
the 1-Schatten norm (and 2-Schatten norm if a mixed penalization is considered), the 
residue r n {-) does not depend on m and T directly: it only depends on the 1-Schatten 
norm of Aq, see Section 2 for details. This was not, as far as we know, previously 
noticed in literature (all the upper bounds obtained for \\A n — Ao\\ s depend directly 
on m and T and on H^oH-Si or ° n its rank and on H-AoHs^ > see the references above and 
below). This fact can be used to argue that \\-\\s 1 is a better measure of sparsity than 
the rank, and it points out an interesting difference between nuclear-norm penalization 
(also called "Matrix Lasso") and the Lasso for vectors. 

In [34], which is a work close to ours, upper bounds for p-Schatten penalization 
procedures for < p < 1 are given in the same setting as ours, including in particu- 
lar the matrix completion problem. The results are stated without the incoherency 
assumption for matrix completion. But for this problem, the upper bounds are given 
using the empirical norm \\A n — Aq\\^ = Yli=i(Xi, An — A ) 2 /n only. An upper 
bound for this measure of accuracy gives information only about the denoising part 
and not about the filling part of the matrix completion problem. Our results have 
the form (6), and taking Aq instead of the minimum in this equation gives an up- 
per bound for R(A n ) — R(Aq), which is equal to \\A n — Ao\\^g /(mT) in the matrix 
completion problem when E(Y|X) = (X,Aq) (see Section 2). 

Second motivation. In the setting considered here, an assumption called Restricted 
Isometry (RI) on the sampling operator C(A) = {(X\,A), . . . , (X n ,A))/y/n has been 
introduced in [33] and used in a series of papers, see [34], [13], [29, 30]. This assump- 
tion is the matrix version of the restricted isometry assumption for vectors introduced 
in [12]. Note that in the high-dimensional setting (mT 3> n), this assumption is not 
satisfied in the matrix completion problem, see [34] for instance, which works with 
and without this assumption. The RI assumption is very restrictive and (up to now) 
is only satisfied by some special random matrices (cf. [36, 22, 28, 27] and references 
therein). This is a second motivation for this work: our results do not require any 
RI assumption. Our assumptions on X are very mild, see Section 2, and are satisfied 
in the matrix completion problem, as well as other problems, such as the multi-task 
learning. 

Third motivation. Our results are the first to give an analysis of nuclear-norm 
penalization (and of other penalizations as well, see below) as a regularization algo- 
rithm. Indeed, an oracle inequality of the form (6) proves that these penalization 
procedures have a prediction accuracy close to the deterministic oracle one, given 
that the reguralization parameters are well-chosen. 

Fourth motivation. We give oracle inequalities for penalization procedures involv- 
ing a mixture of several norms: || • \\s 1 , || • ||g and the ^i-norm || • ||i. As far as we 
know, no result for penalization using several norms was previously given in literature 
for high-dimensional matrix prediction. 

Procedures based on 1-Schatten norm penalization have been considered by many 
authors recently, with applications to multi-task learning and collaborative filtering. 
The first studies are probably the ones given in [38, 39], using the hinge loss for binary 
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classification. In [6], it is proved, under some condition on the AQ, that nuclear norm 
penalization can consistently recover rank(Ao) when n — > +00. Let us recall also 
the references we mentioned above and close other ones [18, 33], [13, 11, 15, 14, 16], 
[24, 23], [34], [21], [32, 33], [29, 30], [4, 3, 5], [1]. 

1.3 The procedures studied in this work 

If E(y|X) = {X, Aq) where Aq is low rank, in the sense that r n, nuclear norm 
penalization (4) is likely to enjoy some good prediction performances. But, if we know 
more about the properties of Aq, then other penalization procedure can be considered. 
For instance, if we know that the non-zero singular values of Aq are "well-spread" 
(that is almost equal) then it may be interesting to use the "regularization effect" of 
a "S2 norm" based penalty in the same spirit as "ridge type" penalty for vectors or 
functions. Moreover, if we know that many entries of Aq are close or equal to zero, 
then using also a ^-penalization 

Am- \ A p,i\ ( 7 ) 

l<p<m 
1<9<T 

may improve even further the prediction. In this paper, we consider a penalization 
that uses a mixture of several norms: for Ai, A2, A3 > 0, we consider 

pen AljA2iA;! (A) = Ai||A|| Sl + A 2 p||| 2 + A 3 p||i (8) 
and we will study the prediction properties of 

A n (\ 1 ,X 2 , A 3 ) e argmin <^ R n (A) + pen AlA2)A3 (A) . (9) 

A£M m ,T L J 

Of course, if more is known on the structure of Aq, other penalty functions can be 
considered. 

We obtain sharp oracle inequalities for the procedure A n (Xi, A2, A3) for any val- 
ues of Ai,A2,A3 > (excepted for (Ai,A2,A 3 ) = (0,0,0) which provides the well- 
studied empirical risk minimization procedure). In particular, depending on the "a 
priori" knowledge that we have on Aq we will consider different values for the triple 
(Ai,A2,A3). If Aq is only known to be low-rank, one should choose Ai > and 
A2 = A 3 = 0. If Aq is known to be low-rank with many zero entries, one should 
choose Ai,A 3 > and A2 = 0. If Aq is known to be low-rank with well-spread non- 
zero singular values, one should choose Ai,A2 > and A 3 = 0. Finally, one should 
choose Ai, A2, A3 > when a significant part of the entries of Aq are zero, that Aq is 
low rank and that the non-zero singular values of Aq are well-spread. 

2 Results 

We will use the following notation: for a matrix A G A4 m ,T, vec(A) denotes the vector 
of IR mT obtained by stacking its columns into a single vector. Note that this is an 
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isometry between {M. m ,T, || ■ ||s 2 ) an d (M , | • |^mr) since (A, f?) = (vec A, vec .B). We 
introduce also the 

^oo norm H^Hqq — maxp^ |^4p g|. Let us recall that for a. ^ 1, the 'i/'a- 
norm of a random variable Z is given by H^H^ := inf{c > : E[exp(|Z| a /c a ))] < 2} 
and a similar norm can be defined for < a < 1 (cf. [25]). 

2.1 Assumptions and examples 

The first assumption concers the "covariate" matrix X. 

Assumption 1 (Matrix X). There are positive constants bx^jbx/^ and bx,2 such 
that WXWs^ < bx,oo, 1 1 -X" | |oo < bx/oa and \\X\\s 2 < bx,2 almost surely. Moreover, we 
assume that the "covariance matrix" 

S := E[vecX(vecX) T ] 

is invertible. 

This assumption is met in the matrix completion and the multitask-learning prob- 
lems: 

1. In the matrix completion problem, the matrix X is uniformly distributed over 
the set {e ps : 1 < p < m, 1 < q < T} (see Section (1.2)), so in this case 
E = {mTY l I mxT and b x ,2 = b x ,oo = bxi^ = 1- 

2. In the multitask-learning problem, the matrix X is uniformly distributed in 
{Aj(xj !S ) : j = 1,... ,T;s = 1,. . . ,kj}, where {x jjS : j = 1, . . . , T; s = 1, . . . , kj) 
is a family of vectors in R m and for any j = 1, . . . , T and x G M m , Aj (x) G Ai m ,T 
is the matrix having the vector x for j-th column and zero everywhere else. So, 
in this case £ is equal to times the mT x mT block matrix with T diagonal 
blocks of size m x m made of the T matrices kj 1 YliLi x j,s x J s f° r J = 1, • • • , T 1 . 

If we assume that the smallest singular values of the matrices kj 1 Y^L-i Xj s x] ' G 
A4 mim for j = 1, . . . , T are larger than a constant cr m i n (note that this implies 
that kj > m), then £ has its smallest singular value larger than cr m i n T -1 , so it 
is invertible. Moreover, if the vectors normalized in £2, then one can 

take b x ,oo = bx/ x = b x ,2 = 1- 

The next assumption deals with the regression function of Y given X. It is 
standard in regression analysis. 

Assumption 2 (Noise). There are positive constants by, &y,oo> by^, by,2 such that 
\\Y -M(Y\X)\\^ 2 < b Y ^ 2 , \\E(Y\X)\\ Loc < b Y)OD , E[(Y - E(Y\X)) 2 \X] <b\ 2 almost 
surely and EY 2 < b\ . 

In particular, any model Y = (Aq,X) + e, where ||A)||Soo < +°° an d £ is a 
sub-gaussian noise satisfies Assumption 2. Note that by using the whole strength of 
Talagrand's concentration inequality on product spaces for tp a (0 < a < 1) random 
variables obtained in [2], other type of tail decay of the noise could be considered (yet 
leading to slower decay of the residual term) depending on this assumption. 
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2.2 Main results 



In this section we state our main results. We give sharp oracle inequalities for the 
penalized empirical risk minimization procedure 

1 n 

A n G argmin f-VVi - {X i} A)) 2 + pen(A)), (10) 



n 

i=l 



where pen(^4) is a penalty function which will be either a pure || • \\s 1 penalization, or 
a "matrix elastic-net" penalization || • + || • ||| 2 or other penalty functions involving 
the || • ||i norm. 

Theorem 1 (Pure || • \\s 1 penalization). There is an absolute constants c > such 
that the following holds. Let Assumptions 1 and 2 hold, and let x > be the some 
fixed confidence level. Consider any 

A n G argmin {R n (A) + A„ iX .p|| 5l }, 
AeM m , T 



for 



(x + log n) log n 

*n,x = CXY 



n 



where cxy '■= c(l + b x 2 + bybx + + b Yoo + b Y2 + ft| 00 ). Then one has, with a 
probability larger than 1 — 5e~ x , that 

R(A n ) < inf {R(A) + X ntX (l + \\A\\ Sl )}. 

AeM m>T 

Note that the residue that we obtain is of the form H^ollsi/v^- I n particular, 
this residual term is not deteriorated if Aq is of full rank but close to a low rank 
matrix. Classical residue involving the rank of Aq are useless in this situation. It is 
also still meaningful when the quantity m + T becomes large compare to n. This is 
not the case of the residue of the form r(m + T)/n obtained previously for the same 
procedure (for other risks and under other - stronger - assumptions). 

We now state three sharp oracle inequalities for procedures of the form (10) where 
the penalty function is a mixture of norms. 

Theorem 2 (Matrix Elastic-Net). There is an absolute constant c > such that 
the following holds. Let Assumptions 1 and 2 hold. Fix any x > 0, r±,r2 > 0, and 
consider 

A n G argmin {R n (A) + \^ x (r 1 \\A\\ Sl + r 2 ||A||| 2 )), 
AeM m , T } 

where 

log n / 1 (x + log n) log n 

*n,x = CXY ^ h 
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where cx,y = c(l + b x 2 + bx,2by ^ x +b\ ca +b\ 2 } . Then one has, with a probability 
larger than 1 — 5e~ x , that 

R(A n ) < inf {R(A) + X ntX (l + n\\A\\ Sl + r 2 M||| a )}. 

A€M m ,T 

Theorem 3 (|| • \\s 1 + || • ||i penalization). There is an absolute constant c > such 
that the following holds. Let Assumptions 1 and 2 hold. Fix any x,ri,r% > 0, and 
consider 

A n G argmin { R n {A) + A 

II. X (nWAWs, +r 3 ||A||i)} 

AeM m , T 



for 



. 1 \J\og(mT) \ (x + log n) (log n) 3 / 2 

X n ,x '■= CXY — A 



ri r 3 / V n 



where cx,y = c(l + 6^ j2 + bx^by + by^ + b YtOQ + b\ 2 + 6^ )00 + b 2 xlaa ). Then one 
has, with a probability larger than 1 — 5e~ x , that 

R(A n )< inf {R(A) + X ntX (l + r 1 \\A\\ Sl +r 3 \\A\\ 1 ))}. 

A£M m , T 

Theorem 4 (|| • lift + II ' lis + II ' 111 penalization). There is an absolute constant c > 
such that the following holds. Let Assumptions 1 and 2 hold. Fix any x, r\,r 2 ,rs > 0, 
and consider 

A n e argmin {Rn(A) + X n>x (r 1 \\A\\ Sl + r 2 \\ A||| 2 + r 3 ||A||i)} 
AeM m ,r J 

for 

(logn) 3 / 2 /l Y^iog(mT) x + logn 

X n ,x '■= c-xy 7= — ( — A 1 -j=— 

\r\ r 3 r 2 \Jn 



where cx,y = c(l + b x 2 + bx,2by + by^ +^00 + ^2)- Then one has, with a probability 
larger than 1 — 5e~ x , that 

R(A n )< inf { J R( J 4) + A n , a; (l + r 1 ||A|| Sl +r 2 p|| 2 2 +r3P|| 1 ))}. 

The parameters r\,r 2 and r 3 in the above procedures are completely free and 
can depend on n, m and T. Intuitively, it is clear that r 2 should be smaller than 
ri since the || • \\s 2 term is used for "regularization" of the non-zero singular values 
only, while the term || • Hs^ makes A n of low rank, as for the elastic- net for vectors 
(see [43]). Indeed, for the || • \\s 1 + \\ • ||| 2 penalization, any choice of r\ and r 2 such 
that r 2 = r\ log n/y/n leads to a residual term smaller than 

/-, n N/(l°g n ) 2 l°g n n (logra) 2 .. . Il9 \ 

c xx {l + x + \ogn)[ K 6 ; +^=\\A\\ Sl + K 6 ; \\A\\%). 



r 2 n wn n 
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Note that the rate related to 1 1 v4. 1 1 ^ 1 is (up to logarithms) l/y/n while the rate related 
to 1 1 -A | |g 2 is 1/n. The choice of depends on the number of zeros in the matrix. Note 
that in the || • \\g 1 + || • ||i case, any choice 1 < r% < r\ entails a residue smaller than 

(x + logn)logn 

C X,Y 1= (1 + \\A Si + \\A l), 

sjn 

which makes again the residue independent of m and T. 

Note that, in the matrix completion case, the term ^/log mT can be removed 
from the regularization (and thus the residual) term thanks to the second statement 
of Proposition 1 below. 

3 Proof of the main results 

3.1 Some definitions 

For any r, ri,r2,r% > 0, we consider the ball 

B r,n,r3,ra : = { A G M m ,T ■ n\\A\\ Sl + r 2 p||| 2 + rsMUl < r}, (11) 

and we denote by B r> \ = S r ,i,o,o the nuclear norm ball, by B r ^ ritr2 = B r;ri>r2) o the 
elastic-net ball. In what follows, B r will be either B r< i, B r>ri ^ 2 , B rjri)r2jT . 3 or B^^, 
depending on the penalization. We consider an oracle matrix in £> r given by: 

A* r e argminE(y - {X,A)) 2 
AeB r 

and the following excess loss function over B r defined for any A £ B r by 

C r , A (X, Y) := (Y - (X, A)) 2 - (Y - {X, A* r )) 2 . 

Define also the excess loss functions class 

C r := {C rA ■ A £ B r }. (12) 

The star-shaped-hull at of C r is given by 

V r := star(£ r , 0) = {aC T) A ■ A E B r and < a < 1} 

and its localized set at level A > 

V r , x := {g€V r :Kg< A}. (13) 

The proof of Theorems 1 to 4 rely on the isomorphic penalization method, introduced 
by P. Bartlett, S. Mendelson and J. Neeman (cf. [8], [26] and [7]). It has improved 
several results on penalized empirical risk minimization procedures for the Lasso 
(cf. [7]) and for regularization in reproducing kernel Hilbert spaces (cf. [26]). This 
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approach relies on a sharp analysis of the complexity of the set V T} \. Indeed, an 
important quantity appearing in learning theory is the maximal deviation of the 
empirical distribution around its mean uniformly over a class of function. If V is a 
class of functions, we define the supremum of the deviation of the empirical mean 
around its actual mean over V by 



1 

hav m 



= SUp | 3 Yl h ( X i> Y *) ~ Eh ( X > Y ) 
i=l 



3.2 On the importance of convexity 

An important parameter driving the quality of concentration of \\P n — P\\v to its 
expectation is the so-called Bernstein's parameter (cf. [9]). We are studying this 
parameter in our context without introducing a formal definition of this quantity. 

For every matrix A G Ai m ,T, we consider the random variable Ja '■= (X,A) and 
the following subset of L^: 

C r := {f A :A& B r }, (14) 

where B r = B T ^ rxX2 ^ Ti is given by (11). Because of the convexity of the norms || • \\s 1} 
|| • ||,s 2 and || • ||i, the set C r is convex, for any r,ri,r2,r-^ > 0. Now, consider the 
following minimum 

/;eargmin||Y-/|| L2 (15) 

fec r 



and 



V IT T \ 

C r := min ( b Xoo — ,b x ,2\ — ,bx,t x — ), (16) 



with the convention 1/0 = +oo. 

Lemma 5 (Bernstein's parameter). Let assumptions 1 and 2 hold. There is a unique 
f* satisfying (15) and a unique A* E B r such that f* = Ja*- Moreover, any A £ B r 
satisfies 

E£ rA > E(X,A-A* r ) 2 , 
and the class C r satisfies the following Bernstein's condition: for all A G B r 

EC 2 r>A < 4(b^ 2 + (b Y ,oo + C r f)EC r , A - 

Proof. By convexity of C r we have (Y — f* , / — f*)i,2 < for any / G C r . Thus, we 
have, for any / G C r 

\\y - /IlL - \\y - /;ill 2 = 2(/; -f,Y- /;) + 1|/ - /;n| 2 > ||/ - /;||| 2 . (17) 

In particular, the minimum is unique. Moreover, C r is a closed set and since £ is 
invertible under Assumption 1, there is a unique A* G B r such that /* = fA*- By 
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the trace duality formula and Assumption 1, we have, for any A E B rrir2r3 : 

\Ja\ < \\X\\sJ\A\\ Sl < b X ,oo-, \f A \ < ||X|| S2 P|| 52 < bx,2 X [^, 

r 

and \f A \ < X 

almost surely, so that < C r for any ^4 £ i? r a.s.. Moreover, for any A £ B r : 

C r , A = 2(Y - E(Y\X))(X, A* r - A) + (2E(Y\X) - (A + A*,X))(X, A* — A). (18) 
Thus, using Assumption 2, we obtain 

EC 2 rA = E[4(Y - E(y | X)) 2 (X, A - A* r } 2 + (2E(Y\X) — (X, A + ^)) 2 (A, ^ - ^) 2 ] 

< 4E[(X, A - A* r ) 2 E[{Y - E{Y\ X)) 2 \ X] ] + 4(6y j00 + C r ) 2 E(X, A - A* r } 2 

< 4(6^ + (b Y oo + C r ) 2 )E(X, A - A* r ) 2 , 

which concludes the proof using (17). □ 

3.3 The isomorphic property of the excess loss functions class 

The isomorphic property of a functions class has been introduced in [26] and is a 
consequence of Talagrand's concentration inequality (cf. [40]) applied to a localization 
of the functions class together with the Bernstein property of this class (here this 
property was studied in Lemma 5). We recall here the argument in our special case. 

Theorem 6 ([10]). There exists an absolute constant c > such that the following 
holds. Let Assumptions 1 and 2 hold. Let r > and A(r) > be such that 



E\\P n -P\\ VrKr) < 



A(r) 



Then, with probability larger than 1 — 4e x : for all A £ B r 

-P n C r> A ~ Pn(r, x) < PC r ,A < 2P n C r ^ A + Pn(r, x), 

where ^ 

p n (r,x) := c^A(r) + [by,^ + 6y,oo + b Y ,2 + C r ] 2 ( - ° gn 

and C r has been introduced in (16). 

Proof. We follow the line of [10]. Let A > and x > 0. Thanks to [2], with probability 
larger than 1 — 4exp(— x), 

\\P-P n \\v r , x < 2E\\P - P n \\ VrX +Cl a(V r ,x)J^ + c 2 b n (V rA )^ (19) 
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where, by using the Bernstein's properties of C r (cf. Lemma 5) 



a\V rX ) ■= sup Var(#) < sup [E(aC T a) ■ 0< a< l,Ae B r ,E(a£ rA ) < A 
gev r , x v 

< sup U(b Y2 + (b Y ,oo + C r ) 2 )E{aC rA ) ■ < a < 1, A G B r ,E(a£ r>A ) < 

< 4(6^ + (b Y ,oo + C r f)\, 
and using Pisier's inequality (cf. [42]): 



(20) 



b n (V r \) := max sup g(Xi,Yi] 

l<i<n geV 



< logn 



sup g(X,Y] 



ij-i 



logn 



sup 



(a(2Y — (X, A + A*))(X, A* — A) :0<a<l,AeB, 



< 4(logn)(6y i ^ 1 + b Y ,oo + C r )C r , 
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where we used decomposition (18) and Assumption 2 together with the uniform bound 
\(A,X}\ < C r holding for all A G B r . 

Moreover, for any A > 0, V r ^\ is star-shaped so G : A i— > E\\P — P n \\y rX /\ is 
non-increasing. Since G(A(r)) < 1/8 and p n (r,x) > A(r), we have 



np-Pn\\v, 



< Pn(r,x)/8, 



which yields, in Equation (19) together with the variance control of Equation (20) 
and the control of Equation (21), that there exists an event f^o of probability measure 
greater than 1 — 4exp(— x) such that, on Qq> 



|P " Pn\\ VrM < + Cl(6y,oo + b Y , 2 + C r ) 

x log n 



p n (r, x)x 



+ C2(6y^! + 6y,oo + C r )C, 
Pn(r,x) 



n 



< 



(22) 



in view of the definition of p n (r,x). In particular, on for every A G B r such 
that PC rA < p n (r,x), we have \PC r ,A — Pn£-r,A\ < Pn( r i x )/2- Now, take A G B r 
such that PC r ,A = (3 > Pn{r-,x) and set g = p n (r, x)C r ,A/ P- Since g G V r ,p n {r,x)i 
Equation (22) yields, on $7o, \Pg — Png\ < Pn(r,x)/2 < f3/2 and so (l/2)P n C r , A < 
P£-r,A < (3/2)P n £r,A which concludes the proof. □ 



A function r i— > A(r) such that E||P n — P| 



r, A(r) 



< A(r)/8 is called an isomorphic 



function and is directly connected to the choice of the penalization used in the proce- 
dure which was introduced in Section 2. The computation of this function is related 
to the complexity of Schatten balls, computed in the next section. 
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3.4 Complexity of Schatten balls 

The generic chaining technique (see [41]) is a powerful technique for the control of 
the supremum of empirical processes. For a subgaussian process, such a control is 
achieved using the 72 functional recalled in the next definition. 

Definition 7 ([41]). Let (F,d) be a metric space. We say that (Fj)j>o is an admis- 
sible sequence of partitions of F if \Fq\ = 1 and \Fj\ < 2 2J for all j > 1. The 72 
functional is defined by 



where the infimum is taken over all admissible sequence (Fj)j>i of F. 

A classical upper bound on the 72 functional is the Dudley's entropy integral: 



where N(B, \\ ■ ||,e) is the minimal number of balls with respect to the metric d of 
radius e needed to cover B. When B enjoys some convexity properties, this bound 
can be improved. Let (E, || • ||) be a Banach space. We denote by B{E) its unit 
ball. We say that (E, \\ ■ ||) is 2-convex if there exists some p > such that for all 
x,y £ B(E), we have 



In the case of 2-convex bodies, the following theorem gives an upper bound on the 72 
functional that can improve the one given by Dudley's entropy integral. 

Theorem 8 ([41]). For any p > 0, there exists c(p) > such that if (E, \\ ■ \\) is a 
2-convex Banach space and \\ ■ \\e is another norm on E, then 



The generic chaining technique provides the following upper bound on Gaussian 
processes. 

Theorem 9 ([41]). There is an absolute constants c > such that the following holds. 
If (Zf)f G p is a subgaussian process for some metric d (i.e. \\Zf — Z g \\^ 2 < cod(f,g) 
for all f,g G F) and if /o £ F , then one has 



l2 (F,d)= inf sup^2^ 2 d(/,F j ) 




(23) 



x + y\\ < 2 — 2p\\x — y 




E sup \Zf — Zf j < C72 (F, d) . 
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The metric used to measure the complexity of the excess loss classes we are working 
on is an empirical one defined for any A G M. m ,T by 

p||oo,n := max \(X h A)\. (24) 

l<i<n 

This metric comes out of the so-called Loo^-method of M. Rudelson introduced in [35] 
and first used in learning theory in [26]. We denote by B(S P ) the unit ball of the 
Banach space S p of matrices in M. m ,T endowed with the Schatten norm || • ||g . We 
denote also by B\ the unit ball of «M m T endowed with the £i-norm || • ||i- In the 
following, we compute the complexity of the balls B(S\), B(S 2 ) and B\ with respect 
to the empirical metric || • ||oo,n- 

Proposition 1. There exists an absolute constant c > such that the following holds. 
Assume that \\Xi\\s 2 , ||-Xi||oo — 1 f or all i = 1, . . . , n. Then, we have 

j 2 (rB(Si), || • ||oo,n) < 72(rB(S 2 ), \\ ■ \\oo,n) < cr log n 

and 

72(^1, || • ||oo, n ) < cr{\ognf/ 2 ^\og{mT). 

Moreover, if we assume that X%, . . . ,X n have been obtained in the matrix completion 
model then 

72(^1,11 • lloo.n) < cr{lognf /2 . 

Proof. The first inequality is obvious since B(S\) C B(S 2 ). By using Dual Sudakov's 
inequality (cf. [31]), we have for all e > 0, 

log7V(S(S 2 ), ||- ll^e) <c ( E||G ^ n 

where GisamxT matrix with i.i.d. standard Gaussian random variables for entries. 
A Gaussian maximal inequality and the fact that ||Xj||s 2 < 1 for all i = l,...,n 
provides E||G||oo,n < c\y/\ogn, hence 

c 2 log n 



logN(B(S 2 ),\\ • lU^e) < 



e 



2 



Denote by -E>oo,n the unit ball of {M. m ,Ti II ' ||oo,n) m V n = span(XL, . . . , X n ), the linear 
subspace of M. m ,T spanned by X\ , . . . , X n . The volumetric argument provides 

logN{B{S 2 ),\\ ■ ||oo.n, e) < log N{B{S 2 ),\\ ■ H^, n) + log N^B^, eB^ n ) 

c 2 \ogn {3ri 

< t, V n log — 

n* V e 



for any n > e > 0. Thus, for r] n = ^/logn/n, we have, for all < e < rj n 
log N(B{S 2 ),\\ ■ !!«,,„, e) <c 3 n log 
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Since B(S2) is the unit ball of a Hilbert space, it is 2-convex. We can thus apply 
Theorem 8 to obtain the following upper bound 

l2(rB(S 2 ), || • ||oo,n) < c 4 rlogn. 

Now, we prove an upper bound on the complexity of B\ with respect to || • \\oc n* 
Recall that vec : M. m ,T M mT concatenates the columns of a matrix into a single 
vector of size mT. Obviously, vec is an isometry between (M m ,T, \\ ■ ||s 2 ) an d (K mT , | • 
I2), since (A, B) = (vec(A),vec(B)). Using this mapping, we see that, for any e > 0, 

N{B 1} \\ • lU^e) = N(b? T \\ • U^e) 

where b™ T is the unit ball of and | • |oo jn is the pseudo norm on M. mT defined 
for any x G M mT by |x|oo jn = maxi<j< n \(yi,x}\ where yi = vec(Xj) for i = 1, . . . , n. 
Note that yi,...,y n G bf T , where bf T is the unit ball of t r 2 aT . We use the Carl- 
Maurey's empirical method to compute the covering number N(b" lT ,\ • |oo,nj e ) f° r 
"large scales" of e and the volumetric argument for "small scales" . Let us begin with 
the Carl-Maurey's argument. Let x G b™ T and Z be a random variable with values 
in {±ei, . . . , ±e m T, 0} - where (ei, . . . , e m T) is the canonical basis of M. mT - defined 
by ¥[Z = 0] = 1 — \x\i and for all i = 1, . . . , mT, 

F[Z = sign(xi)ej] = \xi\. 

Note that EZ = x. Let s G N — {0} to be defined later and take s i.i.d. copies of 
Z denoted by Z\, . . . , Z s . By the Gine-Zinn symmetrization argument and the fact 
that Rademacher processes are upper bounded by Gaussian processes, we have 



IS ^ ' 



EZ 



' " i=l 



^1 1 v /logTl 



where the last inequality follows by a Gaussian maximal inequality and the fact that 
\yi\2 < 1- Take s G N to be the smallest integer such that e > c\yj (log n)/s. Then, 
the set 

1 s 

j-y^Zj :z 1 ,...,z s € {±ei, . . . ,±e mT ,0}| (26) 



is an e-net of b" lT with respect to | • |oo n- Indeed, thanks to (25) there exists oj G f2 
such that |s _1 ^i=i ^j( w ) — xjoon ^ 6. This implies that there exists an element in 
the set (26) which is e-close to x. Since the cardinality of the set introduced in (26) 
is, according to [17], at most 

(2mT + s-l\ fe{2mT + s - I) 



^2mT + s - 1\ 



s 
1/2 



we obtain for any e > rj n := ( (log n) (log mT)/n) that 

„ t /,™t , \ , /e(2mT + s — 1)\ C2(log n) log(mT) 
logAT(&r T ,Hoo,„,e)<slog(^ < 21 ^ SV ^ ; 
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and a volumetric argument gives 



logN(bT T ,\-\oo,n,e)<c 3 nlog(^ 



c 

for any < e < r/ n . Now we use the upper bound (23) and compute the Dudley's 
entropy integral to obtain 

72(r#i, || • ||oo,n) < c 4 r(logn) 3/2 Vlog(mT). 

For the "matrix completion case", we have 

N(b? T ,\-\ 00 , n ,e)<N(b n 1 ,eb n 00 ) 

where N(b™, e6^) is the minimal number of balls eb 1 ^ needed to cover 6". We use the 
following proposition from [37] to compute iV(&™, efo^J. 



Proposition 2 ([37]). For any e > 0, we have 
logJV(6?,e6»)~ < 



if e > 1 

e" 1 log (ene) z/n" 1 < e < 1 
?ilog (l/(en)) z/0<e<n~ 1 . 



Then the result follows from (23) and the computation of the Dudley's entropy 
integral using Proposition 2. □ 

3.5 Computation of the isomorphic function 

Introduce the ellipsoid 

D := {AeM m , T -.nX,A) 2 <1}. 

A consequence of Equation (17) in Lemma 5 is the following inclusion, of importance 
in what follows. Indeed, since B T is convex and symmetrical, one has: 

{A G B r : EC rA < A} C A* + K r , x , (27) 

where 

K r \ := 2B r n VXD. 

Hence, the complexity of {^4 G M. m ,T '■ C-r.A £ £r,\} will be smaller than the com- 
plexity of B r and y/\D. This will be of importance in the analysis below. The next 
result provides an upper bound on the complexity of V r \ where we recall that 

Vr )X ■= {a£ r ,A :0<a<l,A€ B r ,E(aC ryA ) < A}. 

From this statement we will derive corollaries that provide the shape of the considered 
penalty functions. 
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Proposition 3. There exists two absolute constants c\ and c 2 such that the following 
holds. Let Assumptions 1 and 2 hold. For any r > and A > 0, we have 



E\\P - P n \\ VrA < d 2~Vn(r, 2 i+1 A), 



i>0 

where 



Mr A) := c4U n (K r>x )J^+U n (K riX )J^+ Un{Kr ' x)2 



n V n n 

for K r>x = 2B r n y/XD. 

Proof. Introduce C r ^\ = {C Tj a ■ A £ B r ,KC T: A < A}. Using the Gine-Zinn sym- 
metrization [20] and the inclusion of (27), one has, for any r > and A > 0, 

2 n 
E\\P-P n \\ CrX <EE £ - sup lY'dCr^XiM 
n AeA * +Kr X I ^ 

where e±, . . . , e n are n i.i.d Rademacher variables. Introduce the Rademacher process 
Za '■= Y17=l e i£r,A(Xi,Yi), and note that for any A, A' £ A* + K r ^\: 



K\Z A - Z A >\ 2 = J2(X h A- A') 2 (2Y i - (Xi,A + A'))' 



i=l 

4 4- A' 

4 ^(X, A - A')\Yi - (X u A* r ) - (X h -±— - A* r )f 

i=l 



<8\\A- A'tJ^iY- (X u A* r )) 2 + sup ^{X^A)'' 

where we recall that ||^4|| n ,oo = rnaxj = i v .. in \(Xi, A)\. So, using the generic chaining 
mechanism (cf. Theorem (9)), we obtain 



E\\P-P n \\ CrX < — E 



l2(K r ^\\-\\ n ,oo)(Y J {Y i -{X i ,A;)) 2 + sup Y^{Xi,A)^ 



1 n 

< ^(E 72 (K r , A , || • Ikoo) 2 ) 1 / 2 ^^*) +E sup - J2(Xi,A)'' 

Now, introduce, for some set K C M. m ,T the functional 

U n {K) := (E 72 (K,|| • ||„,oo) 2 ) 1/2 . 
Using Theorem 1.2 from [22], we obtain: 

E sup IV(^,A) 2 <A + cmax(^^^ 



1/2 
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and so, we arrive at 

E\\P - P n \\ Cr , x <c0 n (r,A), 

where 

, , Un(K r , x ) fx VXU n (K r , x ) U n (K r , X )\l/2 
cp n {r, A) := c -= — A + R(A r ) H -= h 



n \ \/n n 



n v n n 

We conclude with the peeling argument provided in Lemma 4.6 of [26]: 

E||P-P n ||v r , A < C^^nP -Pn\\c rfii+lx - □ 

Now, we can derive the following corollary. It gives several upper bounds for 
E||P — P n ||y r A , depending on what B r is (i.e. which penalty function is used). 

Corollary 1 (|| • penalization). Let Assumptions 1 and 2 hold and assume that 
B r = P r ,i,o,o f or r > 0, see (11). Then, we have 

m\P_ P j v 

II 11 II v r,k± (r) — Q 



for any r > 0, where 



b 2 X2 r 2 (logn) 2 b x ,2b Y r log; 



Xi(r) = c[ ' — +' 



n \ n 



Proof. If -B r = rB(Si), we have using the embedding K r \ C 2P r and Proposition 1 
that U n (K rt x) < cbx,2"i~logn, so 



[X / R{A*) , i»l )X r 2 (logn) 2 > 
(r, A) < c 6x,2^1ognW - + 6x,2^1ognW 1 =: c<p n ,i(r, x). 



n V n n 

Hence, using Proposition 3 we obtain 

n\P-Pn\\v r , x < c^2-> njl (r,2 i+1 A) < c0 n ,i(r,A), 

i>0 

where we used the fact that the sum is comparable to its first term because of the 
exponential decay of the summands. Thus, one has E||P — P n ||y rA < A/8 when 
A > c(f) nt i(r, A). In particular, since R(A* r ) < EY 2 < b 2 Y (see Assumption 2), for 
values of A such that 

. ^ f b x,2 r2 ( l °g n ) 2 b x ,2b Y rlogn 
A > c h 



\ n \jn 
we have E||P — P n ||v r x < A/8, which proves the Corollary. □ 
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Corollary 2 (|| • \\s 1 + || • ||i penalization). Let Assumptions 1 and 2 hold and assume 
that B r = -B r ,ri,o,r 3 for r,ri,r$ > 0, see (11). Then, we have 



Ki,0,r 3 (, r ) 



for any r > 0, where 

V,0,r 3 ( r ) = c (~2 A 



1 a log(mT)^4,2 r2 ( lo g n ) 2 , / 1 A v/loi(mr)^x,26yr(log?i) 3 / 2 

r\ r 3 



n 



Proof. The proof follows the same steps as the proof of Corollary 1 . 



□ 



Corollary 3 (|| • \\s 1 + || • ||| 2 penalization). Let Assumptions 1 and 2 hold and assume 
that B r = B r ^ ri ^ 2 fi for r, r\,r 2 > 0, see (11). Then, we have 

-Vi,r2 i r ) 



EWP-P, 



n\\V r 



< 



for any r > 0, where 

-Vi,r 2 ( r ) = c ( 

Proof. Use the inclusion 



6 x,2 r 0°g n ) 2 b x ,2b Y r log 



n 



r 2 n 



riy/n 



B r C J-B(S 2 )n-B(S 1 ) 
V r 2 r\ 



to obtain using Proposition 1 that 



<An(r,A) < c [b x ,2\ — lognW- + &x, 2 — logm/-^-^ + — ' . 

V V r 2 V n ri V n / 

The remaining of the proof is the same as the one of Corollary 1 so it is omitted. □ 

Corollary 4 (|| • \\s 1 + || • \\g + || • ||i penalization). Let Assumptions 1 and 2 hold 



and assume that B r = B 



for any r > 0, where 



r ,r 1: r2,r 3 for r,r\,r 2 ,r 3 > 0, see (11). Then, we have 
E\\P-P n \\ V ^ <^l£2£3(0 



b\ 2 r{\ogn) 2 + y/log(mT) \ b x , 2 b Y r(\og nfl 2 . 

r 2 n Vri r% J ^fn 



Proof. The proof follows the same steps as the proof of Corollary 3. 



□ 
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The main difference between Ai(r), \,n,o,ra( r ) an d A ri)7 . 2 (r), ^r-L,n,ra{ r ) is that 
A rijT . 2 (r) and X ri ^ 2>r3 (r) are linear in r while Ai(r) and A rii o,r 3 (^) are quadratic. The 
analysis of the isomorphic functions with quadratic terms will require an extra argu- 
ment in the proof, in order to remove them from the penality (see below). 

Remark 1 (Localization does not work here). Note that, in Corollaries 1 to 2, we 
don't use the fact that K r ^\ C VXD, that is, we don't use the localization argument 
which usually allows to derive fast rates in statistical learning theory. Indeed, for 
the matrix completion problem, one has E(X, A — A*) 2 = ^p\\A — , so when 

K(X,A — A*) 2 < X, we only know that A € A* + \J mTXB(S2), leading to a term 
of order mT/n (up to logarithms) in the isomorphic function. This term is way too 
large, since one has typically in matrix completion problems that mT 3> n. 

3.6 Isomorphic penalization method 

We introduce the isomorphic penalization method developed by P. Bartlett, S. Mendel- 
son and J. Neeman in the following general setup. Let (Z,az,v) be a measurable 
space endowed with the probability measure v. We consider Z,Z\,Zi, ■ ■ ■ ,Z n i.i.d. 
random variables having v for common probability distribution. We are given a class 
T of functions on a measurable space {X, o~x\ a loss function and a risk function 

Q:ZxT^R; R(f) = EQ(Z, f). 

For the problem we have in mind, we will use Q((X, Y), A) = (Y — (X, A}) 2 for every 
A e M m ,T- 

Now, we go into the core of the isomorphic penalization method. We are given a 
model F C J- and a family {F r : r > 0} of subsets of F. We consider the following 
definition. 

Definition 10 (cf. [26]). Let p n be a non-negative function defined on R + x 
(which may depend on the sample). We say that the family {F r : r > 0} of subsets of 
F is an ordered, parameterized hierarchy of F with isomorphic function p n when the 
following conditions are satisfied: 

1. {F r : r > 0} is non- decreasing (that is s <t F s C Ft); 

2. for any r > 0, there exists a unique element f* G F r such that R(f*) = 
inf(.R(/) : / G F r ); we consider the excess loss function associated with the 
class F r 

C r ,f(-) = Q(;f)-Q(;f:); (28) 

3. the map r i — > R{f*) is continuous; 

4. for every r > 0, n r > ro F r = F ro ; 

5. U r >oF r = F; 
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6. for every r > and u > 0, with probability at least 1 — exp(— u) 

(l/2)P„£ rJ - p n (r, u) < PC rJ < 2P n £ rJ + p n {r, u), (29) 

for any f £ F r and P n C r j = (1/n) Ya=i ^r,f( Z i)- 

In the context of learning theory, ordered, parametrized hierarchy of a set F 
with isomorphic function p n provides a very general framework for the construction 
of penalized empirical risk minimization procedure. The following result from [26] 
proves that the isomorphic function is a "correct penalty function" . 

Theorem 11 ([26]). There exists absolute positive constants c\ and C2 such that the 
following holds. Let {F r : r > 0} be an ordered, parameterized hierarchy of F with 
isomorphic function p n . Let u > 0. With probability at least 1— exp(— u) any penalized 
empirical risk minimization procedure 

f e argmin (R n (f) + c lPn (2(r(f) + 1), Q(r(f) + l,u))) , (30) 
feF v ' 

where r{f) = inf(r > : / G F r ) and R n {f) = (1/"-) Yli=i Q{ z i-> /) ^ the empirical 
risk of f , satisfies 

R(f) < inf (R<J) + c 2Pn (2(r(/) + 1), 8(r(f) + 1, «))) 

where for all r > 1 and x > 0, 

x) = x + ln(vr 2 /6) + 2 In (l + - 2 ... + log r) . 

V p n (0,x + log(7r 2 /6)) & J 

3.7 End of the proof of Theorems 1 and 2 

First, we need to prove that the family of models {B r : r > 0} is an ordered, 
parametrized hierarchy of M m ,T- First, fourth and fifth points of Definition 10 are 
easy to check. Second point follows from Lemma 5. For the third point, we consider 
< q < r < s, (3 := q/r and a := r/s. Since otA* 6 B r , we have 

< R(A* r )-R(A*) < R(aA* s )-R(A* s ) < (a 2 -l)||(X, A*)\\l 2 +2(l-a)\\Y\\ 2 \\(X, A*)\\ L . 

As s — > r, the rights hand side tends to zero (because {X, A^) are uniformly bounded 
in Z/2 for s £ [r, r + 1]). So r i— > R(A*) is upper semi-continuous on (0,oo). The 
continuity in r = follows the same line. In the other direction, 

< R(A* q )-R(A* r ) < R((3A;)-R(A* r ) < (a 2 -l)||(X, A;)||| 2 +2(l-a)||y|| 2 ||(X,^)|| i2 
and the right hand side tends to zero for the same reason as before. 
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Now, we turn to the sixth point of Definition 10. That is the computation of the 
isomorphic function p n associated with the family {B r : r > 0}. Using Theorem 6 we 
obtain that, with a probability larger than 1 — 4e~ x : 



-P n £ r ,A ~ Pn(r, x) < PC r ,A < 2P n C r ,A + Pn{r, x) VA e B r , 



where 



p n {r,x) :=c A(r) + (b' Y + C r )' 



2 fx log n 



n 



where b' Y '■= by^ x + &Y,oo + by,2, where C r and A(r) are defined depending on the 
considered penalization (see (16) and Corollaries 1 to 4). Now, we apply Theorem 11 
to the hierarchy F r = B r for r > 0. First of all, note that, for every x > and r > 1 



9(r, x) = x + ln(vr 2 /6) + 2 In ( 1 + 
< x + c(log n + log log r ) , 
so p n (2(r + 1), 9(r + l,x)) < p' n (r, x), with: 

p' n (r,x) :=c\x(2(r + l)) + (by + C r 



EY 2 



p n (0,x + log(vr 2 /6)) 



+ logr 



2 (x + log n + log log r) log n 



From now on, the analysis depends on the penalization, so we consider them sepa- 
rately. 

3.7.1 The || • case 

Recall that in this case 



A(r) = c 



b x,2 r (logn) 2 b X)2 b Y r log 



+ 



n 

and C r = bx,oof-, see (16). An easy computation gives p' n (r,x) < p nj i(r,x) where 

(r + l) 2 (x + logn V log logr) logn 



p n ,i(r,x) := cx.y- 



n 



\/p n ,i(r,x), 



where cx,y ■= c(l + fe 2 ^ + + ^y,v-i + b Y,oo + ^y,2 + ^x,oo) an ^ where 

(r + l)(x + log n) log n 



p nj i(r,x) := c x . 



y- 



n 



Note that p n ,i( r , x ) is the penalty we want (the one considered in Theorem 1). Let 
us introduce for short r(A) = ||^4||s a and the following functionals: 

Ai(A) = R{A) + pen^A), A n ,i(A) = R n (A) + pen^A), 
Ai(A) = i?(A) + pen^A), A n ,i(A) = i? n (A) + pen! (A), 
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where pen 1 (A) := p Ui i(r(A), x) and where pen 1 ( J 4) := p n i(r(^4), x) is a penaliza- 
tion that satisfies that, if A G argmin^ A Jl i(A), then we have R{A) < inf^Ai(^4) 
with a probability larger than 1 — 4e _x . Recall that we want to prove that if 
A G argmin^ A nj i(^4), then we have R{A) < inf^Ai^) with a probability larger 
than 1 — 5e~ x . This will follow if we prove 

inf Ai(yl) < inf AUA) and (31) 
A A 

argmin A n i ( A) C argmin A n , i (^4) , (32) 
A A 

so we focus on the proof of these two facts. First of all, let us prove that if p n ,i{r, x) > 
Pn,i(r,x) then both r and p n ,i(r,x) cannot be small. 

If logn < log log r we have r > e n and p Ut i(x,r) > cx,ye n (logn) 2 /y / n. If logn > 
log log r and p n> i(r, x) > p njl (r,x), then 

(r + l) 2 (x + log n) log n (r + 1) (x + log n) log n 
n y/n 

so r > y/n — 1 and p n ,i( r , x ) > cx,y(logn) 2 . Hence, we proved that if p n ,i( r , %) > 
p n ,i(r, x), then r > 1 and p n ^{r,x) > cx,y(logn) 2 . Note also that p n ,i(r, x) > 
2(x + logn) \ogn/y/n since r > 1. 

Let us turn to the proof of (31). Let A' be such that Ai(A') > Ai(A'). Then 
pen! (A') >pen 1 (A'),iep n>1 (r(A'),x) > p n>1 (r(A'), x), so that r(A') > 1, p nA (r(A'),x) > 
cx,Y (log n) 2 and p n] i(r(^4'), x) > 2cx,y(x + log n) logn/ y^- On the other hand, we 
have inf A Ai(yl) < fc^ + pen^O) = b\ +p n ,i{0, x). But p nA (r (A'), x) > c x ,y(logn) 2 > 
2by and p n> i(r(^4'), x) > 2p n> i(0,x) since r(A') > 1, so that b\ +p n> i(0,x) < 
p nt i(r(A'),x) and then 

infAi(^) <p„(r(^')^) < Ai(A'). 

Hence, we proved that if ^4' is such that Ai(^4') < inf^A^A), we have Ai(A') < 
Ai(A'), so w£aA-i(A) < Ai(A') < Ai(A') < inf^Ai(A), which proves (31). 

The proof of (32) is almost the same. Let A' be such that A n> i(A') > A nj i(A'), 
so as before we have r(A') > 1, p n) i(r(A'),x) > cx,y(logn) 2 and p nt i(r(A'), x) > 
2cx,y (x +log n) log n/s/n. This time we have inf a A n ,i( A) < n" 1 Ya=i Y i 2 +Pn,i(0, x), 
so we use some concentration for the sum of the Yj 2 's. Indeed, we have, as a conse- 
quence of [2], that 

- Y if < EY 2 + c lA /E(y4)- + c 2 logn l|r (33) 
n f— ' V n n 

with a probability larger than 1 — e~ x . But then, it is easy to infer that for n large 
enough, the right hand side of (33) is smaller than p nt i(r(A'),x)/2, so that we have, 
on an event of probability larger than 1 — e~ x , that 

1 " 

infA njl (A) < - Vy 2 +p njl (0,x) <p n>1 (r(A'),x) < A n i(A'). 
A n £ — ' 

i=i 
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So, we proved that if A nt i(A') < A nt i(A'), then A' ^ argmin^ A nt i(A), or equivalently 
that argmin A A nA (A) C {A : A n>1 (A) < A n ,i(^)}. But A nA (A) < A nA (A) for any A 
(since p n ,i( r ,x) < p ni \(r,x)), so (32) follows. This concludes the proof of Theorem 1. 

3.7.2 The || • + || • ||i case 

Recall that in this case 

A( r ) = c\(- A V lo ^ mT ) \ 2 b k2 r2 ( l °Sn) 2 + f]_ A y/log(mrh 6 X;2 b y r(log n) 3 / 2 



n r 3 J n \ri r 3 

and that 

/ r r 
C r = min b x ,oo — , b x ,i x — 

see (16). An easy computation gives that p' n (r,x) < p n ,2(r,x), where 

~ , , /l . A/log(mT)\2( r + l) 2 (x + logn V log log r) log re 

Pn,2{r, x) := c x ,y — A - V p n , 2 {r, x), 

\r± r 3 / n 

where c x ,y = c(l + b 2 X2 + & X ,2&Y + b\^ + 6f, )0O + 6f, 2 + b 2 X oo + b 2 x/ J and 

1 A/log(mT)\ (r + l)(x + log n) (log n) 3 / 2 



Pn,2(r, x) := cx,y — A 



n r 3 / v n 

Note that p nt 2(r,x) is the penalization we want (the one considered in Theorem 3). 
Introducing r(A) = ri\\A\\s 1 -h 7~3 1 1 ^4 1 1 1 , the remaining of the proof follows the lines of 
the pure || • \\s 1 case, so it is omitted. 

3.7.3 The || • \\ Sl + \\ ■ ||| 2 case 

This is easier than what we did for the || • \\s 1 case, since we only have a log log r term 
to remove from the penalization. Recall that 

w v (b 2 X2 r(logn) 2 b x ,2b Y r log n 
A(r) = c — ■ h 



r 2 n riy/n 
and 

C r = min (b x ,oc — ,bx.2\ —) < b x ,2\ — , 
\ ri ■ V r 2 / V r 2 

so that p' n (r,x) < p n)3 (r,x) where 

(r + l)logre/l (x + log?i V log log r) logn 
Pn,3{r,x) = c X y -= h 



m \r\ r2\/n 

where c X y = c(l + b 2 X2 + b x ^by + b\^ x + b\ ^ + b\ 2 ). This is almost the penalty 
we want, up to the log log r term, so we consider. 

. . (r + l)logre/l (x + log n) log n 
Pn,3{r,x) = c x , Y 7= h 
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Let us introduce for short 

r(A) := ri \\A\\ Sl + r 2 p||| 2 = inf (r > : A € B r ) 

and the following functionals: 

A 3 (A) = R{A) + pen 3 (A), A n , 3 (A) = R n {A) + pen 3 (A), 
A 3 (A) = R(A) + pen 3 (A), A n , 3 (A) = R n (A) + pen 3 (A), 

where pen 3 (A) := p n3 {r(A), x) and where pen 3 (A) := p Th3 (r(A), x). We only need to 
prove that 

inf A3 (A) < inf A 3 (A) and (34) 
A A 

argminA n3 (A) C argmin A nj3 (A). (35) 
A A 

Obviously, if p n ^(r,x) > p n ^(r,x), then r > e n , so following the arguments we used 
for the Si penalty, it is easy to prove both (34) and (35). This concludes the proof 
of Theorem 2. 

3.7.4 The || • || Sl + || • ||| a + || • ||i case 
Recall that in this case 



vb 2 X2 r{\ogn) 2 f\ ^log(mT)\6 Xi2 6yr(logn) 3 / 2 
Ayr) = c h ( — A 



r 2 n vri r 3 / 



and that 



C r = min (b x ,< x — ,bx,2\ —,bx,£ ao — ) < bx,2\ — , 
\ n V r 2 r 3 / V r 2 

see (16). An easy computation gives that p' n (r,x) < Pn^r, x), where 

, . (r + l)(logn) 3 / 2 / 1 J\og(vnT) x + log n V log log \ 

PnA r > x ) : = C X,Y 7= — A - h 



(36) 



where cx,y = c(l + b\ 2 + bxpby + b Y ^ 1 + b Yoo + b Y2 )- The penalization we want is 



. (r + l)(log re) 3 ' 2 ( 1 v/log (mT) x + logn 

Pn,4(r, x) := cx,r 7= — A - 1 =— 

V" Vri r 3 r 2 y/n 



so introducing r(A) = ri||A||5 1 + r 2 H-AHg + r 3 ||j4||i and following the lines of the 
proof of the S\ + S* 2 case to remove the log log r term, it is easy to conclude the proof 
of Theorem 4. 
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